Spam Detection Using Character N-Grams

نویسندگان

  • Ioannis Kanaris
  • Konstantinos Kanaris
  • Efstathios Stamatatos
چکیده

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection of Opinion Spam with Character n-grams

In this paper we consider the detection of opinion spam as a stylistic classification task because, given a particular domain, the deceptive and truthful opinions are similar in content but differ in the way opinions are written (style). Particularly, we propose using character ngrams as features since they have shown to capture lexical content as well as stylistic information. We evaluated our...

متن کامل

Words vs. Character N-grams for Anti-spam Filtering

The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpe...

متن کامل

Efficient In-memory Data Structures for n-grams Indexing

Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deep...

متن کامل

Privacy Preserving Spam Filtering

Email is a private medium of communication, and the inherent privacy constraints form a major obstacle in developing effective spam filtering methods which require access to a large amount of email data belonging to multiple users. To mitigate this problem, we envision a privacy preserving spam filtering system, where the server is able to train and evaluate a logistic regression based spam cla...

متن کامل

Feature Representation for Effective Action-Item Detection

E-mail users face an ever-growing challenge in managing their inboxes due to the growing centrality of email in the workplace for task assignment, action requests, and other roles beyond information dissemination. Whereas Information Retrieval and Machine Learning techniques are gaining initial acceptance in spam filtering and automated folder assignment, this paper reports on a new task: autom...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006